Hyper log log plus plus(HLL++) #2522

res-life · 2024-10-21T12:45:50Z

Add support for Hyper log log plus plus(HLL++)

Depends on:

Signed-off-by: Chong Gao [email protected]

ttnghia · 2024-11-01T03:14:10Z

src/main/cpp/src/HLLPP.cu

+                                                         rmm::cuda_stream_view stream,
+                                                         rmm::device_async_resource_ref mr)
+{
+  CUDF_EXPECTS(precision >= 4 && precision <= 18, "HLL++ requires precision in range: [4, 18]");


We can use std::numeric_limits<>::digits instead of hardcoded values 4 and 18.

cuCo hardcoded 4, and Spark also hardcoded 4.

ttnghia · 2024-11-01T03:16:48Z

src/main/cpp/src/HLLPP.cu

+  auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
+  auto d_inputs   = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);


Do we need such all-valid null mask? How about cudf::mask_state::UNALLOCATED?

Tested Spark behavior, for approx_count_distinct(null) returns 0.
So the values in result column are always non-null

I meant, if all rows are valid, we don't need to allocate a null mask.
BTW, we need to pass mr to the returning column (but do not pass it to the intermediate vector/column).

Suggested change

cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);

cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::UNALLOCATED, stream, mr);

ttnghia · 2024-11-01T03:17:16Z

src/main/cpp/src/HLLPP.cu

+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
+  // evaluate from struct<long, ..., long>
+  thrust::for_each_n(rmm::exec_policy(stream),


Try to use exec_policy_nosync as much as possible.

Suggested change

thrust::for_each_n(rmm::exec_policy(stream),

thrust::for_each_n(rmm::exec_policy_nosync(stream),

ttnghia · 2024-11-01T03:19:15Z

src/main/java/com/nvidia/spark/rapids/jni/HLLPP.java

+   * The input sketch values must be given in the format `LIST<INT8>`.
+   *
+   * @param input         The sketch column which constains `LIST<INT8> values.


INT8 or INT64?

In addition, in estimate_from_hll_sketches I see that the input is STRUCT<LONG, LONG, ....> instead of LIST<>. Why?

It's STRUCT<LONG, LONG, ....> consistent with Spark. The input is columnar data, e.g.: sketch 0 is composed of by all the data of the children at index 0.
Updated the function comments, refer to commit

Signed-off-by: Chong Gao <[email protected]>

res-life · 2024-11-26T10:38:25Z

Ready to review except test cases.

ttnghia · 2024-12-13T21:46:14Z

src/main/cpp/CMakeLists.txt

@@ -196,6 +196,7 @@ add_library(
  src/HashJni.cpp
  src/HistogramJni.cpp
  src/HostTableJni.cpp
+  src/HLLPPJni.cpp


Let's try to be generic.

Suggested change

src/HLLPPJni.cpp

src/AggregationJni.cpp

Renamed to: HLLPPHostUDFJni
AggregationJni is too generic

ttnghia · 2024-12-13T21:46:42Z

src/main/cpp/CMakeLists.txt

@@ -204,6 +205,7 @@ add_library(
  src/SparkResourceAdaptorJni.cpp
  src/SubStringIndexJni.cpp
  src/ZOrderJni.cpp
+  src/HLLPP.cu


How about HyperLogLogPP?

Suggested change

src/HLLPP.cu

src/HyperLogLogPP.cu

This name is also applied for the .hpp and *.java files.

ttnghia · 2024-12-13T21:46:57Z

src/main/cpp/src/HLLPP.cu

@@ -0,0 +1,102 @@
+/*
+ * Copyright (c) 2023-2024, NVIDIA CORPORATION.


Suggested change

* Copyright (c) 2023-2024, NVIDIA CORPORATION.

* Copyright (c) 2024-2025, NVIDIA CORPORATION.

ttnghia · 2024-12-13T21:54:23Z

src/main/cpp/src/HLLPP.cu

+  int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);
+  int64_t v          = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);


Suggested change

int64_t shift_mask = MASK << (REGISTER_VALUE_BITS * reg_idx);

int64_t v = (long_10_registers & shift_mask) >> (REGISTER_VALUE_BITS * reg_idx);

auto const shift_bits = REGISTER_VALUE_BITS * reg_idx;

auto const shift_mask = MASK << shift_bits;

auto const v = (long_10_registers & shift_mask) >> shift_bit;

ttnghia · 2024-12-13T21:56:12Z

src/main/cpp/src/HLLPP.cu

+}
+
+struct estimate_fn {
+  cudf::device_span<int64_t const*> sketch_longs;


Suggested change

cudf::device_span<int64_t const*> sketch_longs;

cudf::device_span<int64_t const*> sketches;

ttnghia · 2024-12-13T21:57:15Z

src/main/cpp/src/HLLPP.cu

+  int const precision;
+  int64_t* const out;


We now favor non-const members so the functor can be moved by the compiler if needed.
In addition, member variables need to be sorted by their sizes to reduce padding.

Suggested change

int const precision;

int64_t* const out;

int64_t* out;

int precision;

ttnghia · 2024-12-13T21:59:47Z

src/main/cpp/src/HLLPP.cu

+
+  __device__ void operator()(cudf::size_type const idx) const
+  {
+    auto const num_regs = 1ull << precision;


This seems to be used to compare with signed int later, thus it should not be unsigned here.

Suggested change

auto const num_regs = 1ull << precision;

auto const num_regs = 1 << precision;

ttnghia · 2024-12-13T22:22:43Z

src/main/cpp/src/HLLPP.cu

+                                                         rmm::cuda_stream_view stream,
+                                                         rmm::device_async_resource_ref mr)
+{
+  CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");


Suggested change

CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision is bigger than 4.");

CUDF_EXPECTS(precision >= 4, "HyperLogLogPlusPlus requires precision bigger than 4.");

ttnghia · 2024-12-13T22:23:35Z

src/main/cpp/src/HLLPP.cu

+  auto const input_iter = cudf::detail::make_counting_transform_iterator(
+    0, [&](int i) { return input.child(i).begin<int64_t>(); });


We need a CUDF_EXPECTS to check for input type too (struct of longs).

done.
Now all the outer functions check:

CUDF_EXPECTS(input.type().id() == cudf::type_id::STRUCT, "HyperLogLogPlusPlus buffer type must be a STRUCT of long columns."); for (auto i = 0; i < input.num_children(); i++) { CUDF_EXPECTS(input.child(i).type().id() == cudf::type_id::INT64, "HyperLogLogPlusPlus buffer type must be a STRUCT of long columns."); }

res-life · 2024-12-17T13:15:00Z

build

res-life · 2024-12-17T13:19:59Z

Verified Host UDF successfully via NVIDIA/spark-rapids#11638

ttnghia · 2024-12-18T21:45:54Z

Need to wait for the dependencies to be merged first before we can build.

sperlingxx · 2024-12-23T02:24:57Z

src/main/cpp/src/hyper_log_log_plus_plus.cu

+  int64_t const precision,                          // num of bits for register addressing, e.g.: 9
+  int* const registers_output_cache,                // num is num_groups * num_registers_per_sketch
+  int* const registers_thread_cache,                // num is num_threads * num_registers_per_sketch
+  cudf::size_type* const group_lables_thread_cache  // save the group lables for each thread


nit: labels?

ttnghia · 2025-01-04T05:35:09Z

src/main/java/com/nvidia/spark/rapids/jni/HyperLogLogPlusPlusHostUDF.java

+     * sketch. Input is a struct column with multiple long columns which is
+     * consistent with Spark. Output is a struct scalar with multiple long values.
+     */
+    Reduction_MERGE(1),


Naming convention should be consistent with GroupByMerge.

Suggested change

Reduction_MERGE(1),

ReductionMerge(1),

ttnghia · 2025-01-04T05:36:38Z

src/main/java/com/nvidia/spark/rapids/jni/HyperLogLogPlusPlusHostUDF.java

+/**
+ * HyperLogLogPlusPlus(HLLPP) host UDF aggregation utils
+ */
+public class HyperLogLogPlusPlusHostUDF {


Now the Java interface has changed. Please reimplement this similar to https://github.com/NVIDIA/spark-rapids-jni/pull/2631/files#diff-3bf8ba05afd52e4ef36fa2c0431304bbc88bc07cafd976665e17113464811392R24-R41.

ttnghia · 2025-01-04T05:38:58Z

src/main/cpp/src/HyperLogLogPlusPlusHostUDFJni.cpp

+      switch (agg_type) {
+        case 0: return spark_rapids_jni::create_hllpp_reduction_host_udf(precision);
+        case 1: return spark_rapids_jni::create_hllpp_reduction_merge_host_udf(precision);
+        case 2: return spark_rapids_jni::create_hllpp_groupby_host_udf(precision);
+        default: return spark_rapids_jni::create_hllpp_groupby_merge_host_udf(precision);


Suggested change

switch (agg_type) {

case 0: return spark_rapids_jni::create_hllpp_reduction_host_udf(precision);

case 1: return spark_rapids_jni::create_hllpp_reduction_merge_host_udf(precision);

case 2: return spark_rapids_jni::create_hllpp_groupby_host_udf(precision);

default: return spark_rapids_jni::create_hllpp_groupby_merge_host_udf(precision);

switch (agg_type) {

case 0: return spark_rapids_jni::create_hllpp_reduction_host_udf(precision);

case 1: return spark_rapids_jni::create_hllpp_reduction_merge_host_udf(precision);

case 2: return spark_rapids_jni::create_hllpp_groupby_host_udf(precision);

case 3: return spark_rapids_jni::create_hllpp_groupby_merge_host_udf(precision);

default: CUDF_FAIL("Invalid aggregation type.");

ttnghia · 2025-01-04T05:40:24Z

src/main/cpp/src/hyper_log_log_plus_plus.hpp

+/**
+ * The number of bits that is required for a HLLPP register value.
+ *
+ * This number is determined by the maximum number of leading binary zeros a
+ * hashcode can produce. This is equal to the number of bits the hashcode
+ * returns. The current implementation uses a 64-bit hashcode, this means 6-bits
+ * are (at most) needed to store the number of leading zeros.
+ */
+constexpr int REGISTER_VALUE_BITS = 6;
+
+// MASK binary 6 bits: 111-111
+constexpr uint64_t MASK = (1L << REGISTER_VALUE_BITS) - 1L;
+
+// This value is 10, one long stores 10 register values
+constexpr int REGISTERS_PER_LONG = 64 / REGISTER_VALUE_BITS;
+
+// XXHash seed, consistent with Spark
+constexpr int64_t SEED = 42L;
+
+// max precision, if require a precision bigger than 18, then use 18.
+constexpr int MAX_PRECISION = 18;


Are these values need to be exposed to the public? Otherwise, please move them to the source file.

done, moved to source file.

ttnghia · 2025-01-04T05:47:09Z